Published

Tuesday, 16/01/2024

1 Basic Statistical Test

In this session, several tests will be covered

  1. Chi-Square Test
  2. Independent t-test
  3. Linear Regression
  4. Logistic Regression

There will also two part, on how to conduct the tests

  1. Basic, standard way
  2. Using package: easier, tidier, prettier

2 Chi-square Test

2.1 Background

  • test for significant relationship between two categorical variables
  • test assumptions:
    • random sampling
    • independence
    • mutually exclusive
    • <20% of cell have expected count <5

2.2 Conduct Chi-square Test

Objective: To test whether employment status and gender had significant association

  1. Import dataset asthmads_clean.sav
Code
```{r}
library(tidyverse)
library(haven)

asthmads <- read_sav("asthmads_clean.sav") %>% 
  as_factor()
asthmads
```
  1. Creata Crosstabulation
Code
```{r}
with(asthmads, table(Gender, WorkStatus))
```
        WorkStatus
Gender   Unemployed Employed
  Female         47       17
  Male           33       53
Code
```{r}
xtabs(~ Gender + WorkStatus, asthmads)
```
        WorkStatus
Gender   Unemployed Employed
  Female         47       17
  Male           33       53
Code
```{r}
xtabs(~ Gender + WorkStatus, asthmads) %>% 
  prop.table(., margin = 2)
```
        WorkStatus
Gender   Unemployed  Employed
  Female  0.5875000 0.2428571
  Male    0.4125000 0.7571429
Code
```{r}
library(tidyverse)
library(janitor)

asthmads %>%
  count(Gender, WorkStatus)
```
Code
```{r}
asthmads %>%
  count(Gender, WorkStatus) %>%
  pivot_wider(names_from = WorkStatus, values_from = n, 
              values_fill = list(n = 0))
```
Code
```{r}
asthmads %>%
  count(Gender, WorkStatus) %>%
  pivot_wider(names_from = WorkStatus, values_from = n, 
              values_fill = list(n = 0)) %>%
  janitor::adorn_totals(c("row", "col"))
```
Code
```{r}
asthmads %>%
  tabyl(Gender, WorkStatus)
```
Code
```{r}
asthmads %>%
  tabyl(Gender, WorkStatus) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting()
```
  1. Conduct chi-square test with chisq.test(_) function
Code
```{r}
with(asthmads, table(Gender, WorkStatus)) %>% 
  chisq.test(., correc = F)
```

    Pearson's Chi-squared test

data:  .
X-squared = 18.128, df = 1, p-value = 2.066e-05

2.3 Package gtsummary::

gtsummary:: package is my favourite package, providing an elegant and flexible way to create publication-ready analytical and summary tables.

Code
```{r}
library(gtsummary)

asthmads %>% 
  select(Gender, WorkStatus) %>% 
  tbl_summary(by = WorkStatus)
```
Characteristic Unemployed, N = 801 Employed, N = 701
Gender

    Female 47 (59%) 17 (24%)
    Male 33 (41%) 53 (76%)
1 n (%)
Code
```{r}
asthmads %>% 
  select(Gender, WorkStatus) %>% 
  tbl_summary(by = WorkStatus) %>% 
  add_p()
```
Characteristic Unemployed, N = 801 Employed, N = 701 p-value2
Gender

<0.001
    Female 47 (59%) 17 (24%)
    Male 33 (41%) 53 (76%)
1 n (%)
2 Pearson’s Chi-squared test

3 Independent t-test

3.1 Background

  • test for significant difference of normally distributed continuous variable between two group
  • test’s assumption
    • random sampling
    • independence
    • normal distributed
    • equal variance (homoscedasticity)

3.2 Conduct Independent t-test

Objective: to compare body height between gender

  1. import dataset
  • in this example, same dataset (i.e. asthmads) is used.
  1. Confirm data distribution
Code
```{r}
asthmads %>% 
  ggplot(aes(x = Ht_m, fill = Gender)) +
  geom_density(alpha = .5) +
  theme_bw()
```

  1. Calculate mean & sd of height for each gender
Code
```{r}
asthmads %>% 
  group_by(Gender) %>% 
  summarise(mean = mean(Ht_m, na.rm = T),
            sd = sd(Ht_m, na.rm = T))
```
Code
```{r}
asthmads %>% 
  group_by(Gender) %>% 
  descr(Ht_m)
```
Descriptive Statistics  
Ht_m by Gender  
Data Frame: asthmads  
N: 64  

                    Female     Male
----------------- -------- --------
             Mean     1.50     1.74
          Std.Dev     0.11     0.09
              Min     1.29     1.51
               Q1     1.42     1.69
           Median     1.48     1.74
               Q3     1.56     1.79
              Max     1.79     1.95
              MAD     0.10     0.07
              IQR     0.13     0.10
               CV     0.07     0.05
         Skewness     0.51    -0.05
      SE.Skewness     0.30     0.26
         Kurtosis    -0.28     0.03
          N.Valid    64.00    86.00
        Pct.Valid   100.00   100.00
Code
```{r}
library(rstatix)

asthmads %>% 
  group_by(Gender) %>% 
  get_summary_stats(Ht_m)
```
  1. Conduct independent t-test with t.test(_) function
Code
```{r}
t.test(Ht_m ~ Gender, asthmads)
```

    Welch Two Sample t-test

data:  Ht_m by Gender
t = -14.12, df = 119.43, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.2703944 -0.2038862
sample estimates:
mean in group Female   mean in group Male 
            1.503906             1.741047 
Important

In R, many functions require a formula parameter, especially in statistical modelling.

  • The general form of a formula is outcome ~ predictors, data,
    • outcome = dependent variable
    • predictors = independent variables.
  • This formula structure is used in various functions, such as linear modelling.
  • For a t-test, which compares means across groups,
    • the formula formed by outcome ~ group, data
    • group = categorical variable = groups

3.3 Package gtsummary::

Code
```{r}
library(gtsummary)

asthmads %>% 
  select(Gender, Ht_m) %>% 
  tbl_summary(by = Gender,
              statistic = list(all_continuous() ~ "{mean} ({sd})"))

asthmads %>% 
  select(Gender, Ht_m) %>% 
  tbl_summary(by = Gender,
              statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% 
  add_p(test = list(all_continuous() ~ "t.test"))
```
Characteristic Female, N = 641 Male, N = 861
Ht_m 1.50 (0.11) 1.74 (0.09)
1 Mean (SD)
Characteristic Female, N = 641 Male, N = 861 p-value2
Ht_m 1.50 (0.11) 1.74 (0.09) <0.001
1 Mean (SD)
2 Welch Two Sample t-test
Code
```{r}
library(broom)

t.test(Ht_m ~ Gender, asthmads) %>% 
  tidy()
```
Code
```{r}
library(rstatix)

asthmads %>% 
  t_test(Ht_m ~ Gender, detailed = T) 
```

4 Linear Regression

4.1 Background

  • test for significant relationship between independent variable (predictors) with dependent variable (outcome)
    • outcome is numerical variable
    • predictor can be categorical or numerical variable
  • test’s assumptions
    • continuous outcome
    • linearity between IV and DV
    • independence of error
    • normality of residual
    • homoscedascity of residual
    • no multicollinearity (multiple linear regression)

4.2 Conduct Linear Regression

Objective: Find significant predictors of BMI_Changes

Note

BMI_Changes was not available, and need to calculate

  1. import dataset asthmads_clean.sav
  • same dataset used previously
Code
```{r}
asthmads
```
Code
```{r}
asthmads <- asthmads %>% 
  mutate(BMI_Changes = BMI_Post - BMI_Pre)
asthmads
```
  1. Conduct Linear Regression (Simple Linear Regression)
Code
```{r}
slm_PA_HW <- lm(BMI_Changes ~ PA_HW, asthmads)
summary(slm_PA_HW)
```

Call:
lm(formula = BMI_Changes ~ PA_HW, data = asthmads)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.22966 -0.35884  0.02248  0.35711  1.39559 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.84559    0.06743  -27.37   <2e-16 ***
PA_HW       -0.31476    0.01871  -16.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5182 on 148 degrees of freedom
Multiple R-squared:  0.6567,    Adjusted R-squared:  0.6544 
F-statistic: 283.1 on 1 and 148 DF,  p-value: < 2.2e-16
Important

In R, many functions require a formula parameter, especially in statistical modelling.

  • The general form of a formula is outcome ~ predictors, data,
    • outcome = dependent variable
    • predictors = independent variables.
  • if there were several predictors, add with + operator
    • e.g. weight ~ height + age, my_data
  • in multiple linear regression, we use * operator for interaction
    • e.g. weight ~ height + age + height*age, my_data
  1. Test Assumptions
Code
```{r}
plot(slm_PA_HW)
```

4.3 Package gtsummary::

Code
```{r}
asthmads %>% 
  tbl_uvregression(method = lm,
                   y = BMI_Changes,
                   include = PA_HW)
```
Characteristic N Beta 95% CI1 p-value
Physical Activity (total hour per week) 150 -0.31 -0.35, -0.28 <0.001
1 CI = Confidence Interval
Code
```{r}
asthmads %>% 
  tbl_uvregression(method = lm,
                   y = BMI_Changes,
                   include = c(Gender, Age, WorkStatus, BMI_Pre, PA_HW),
                   pvalue_fun = partial(style_pvalue, digits = 3)) %>% 
  bold_p()
```
Characteristic N Beta 95% CI1 p-value
Gender 150


    Female

    Male
0.23 -0.05, 0.52 0.108
Age (year) 150 0.01 -0.04, 0.06 0.609
Employment 150


    Unemployed

    Employed
0.02 -0.26, 0.31 0.884
BMI_Pre 150 -0.02 -0.04, 0.01 0.240
Physical Activity (total hour per week) 150 -0.31 -0.35, -0.28 <0.001
1 CI = Confidence Interval
Code
```{r}
library(broom)

slm_PA_HW %>% 
  tidy()
```
  1. Multiple Linear Regression
Code
```{r}
mlm_bmichanges <- lm(BMI_Changes ~ Gender + BMI_Pre + PA_HW, asthmads)
summary(mlm_bmichanges)
plot(mlm_bmichanges)
```

Call:
lm(formula = BMI_Changes ~ Gender + BMI_Pre + PA_HW, data = asthmads)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24462 -0.33420  0.01937  0.32699  1.30192 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.631072   0.232216  -7.024 7.59e-11 ***
GenderMale   0.109515   0.086543   1.265    0.208    
BMI_Pre     -0.010316   0.008107  -1.273    0.205    
PA_HW       -0.311135   0.018787 -16.561  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5168 on 146 degrees of freedom
Multiple R-squared:  0.6632,    Adjusted R-squared:  0.6563 
F-statistic: 95.83 on 3 and 146 DF,  p-value: < 2.2e-16

4.4 Package gtsummary::

Code
```{r}
mlm_bmichanges %>%
  tbl_regression(pvalue_fun = partial(style_pvalue, digits = 3)) %>% 
  bold_p()
```
Characteristic Beta 95% CI1 p-value
Gender


    Female
    Male 0.11 -0.06, 0.28 0.208
BMI_Pre -0.01 -0.03, 0.01 0.205
Physical Activity (total hour per week) -0.31 -0.35, -0.27 <0.001
1 CI = Confidence Interval

5 Logistic Regression

5.1 Background

  • test for significant association between predictors with outcome
    • outcome is binary categorical variable. e.g., yes/no
    • predictor can be categorical or numerical variables
  • test’s assumptions
    • binary outcome
    • linearity of logit (log odds of the outcome)
    • independence
    • no multicollinearity (multiple logistic regression)
    • no strong outlier
    • goodness of fit

5.2 Conduct Logistic Regression

Objective: Find significant predictors of BMI_CCat (Changes in category) - PA_HW convert to PA_Cat, with cut off >= 2 for low and high - BMI_Changes convert to BMI_CCat, with cut off >= -2.5 for no and yes

  1. import dataset asthmads_clean.sav
  • same dataset used previously
Code
```{r}
asthmads
```
Code
```{r}
asthmads <- asthmads %>% 
  mutate(PA_Cat = cut(PA_HW, 
                      breaks = c(-Inf, 2.0, Inf), 
                      labels = c("Low Intensity", "High Intensity")),
         BMI_CCat = cut(BMI_Changes, 
                        breaks = c(-Inf, -2.50, Inf), 
                        labels = c("Marked Changed", "Min Changed")),
         BMI_CCat = fct_relevel(BMI_CCat, "Min Changed", "Marked Changed"))

asthmads
```
  1. Conduct Logistic Regression (Simple Logistic Regression)
Code
```{r}
slogm_PA <- glm(BMI_CCat ~ PA_Cat, 
                family = binomial(), 
                asthmads)
summary(slogm_PA)

tidy(slogm_PA, conf.int = T,  exponentiate = T)
```

Call:
glm(formula = BMI_CCat ~ PA_Cat, family = binomial(), data = asthmads)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -0.5691     0.2285  -2.491   0.0127 *  
PA_CatHigh Intensity   2.3096     0.4120   5.606 2.07e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 204.09  on 149  degrees of freedom
Residual deviance: 165.07  on 148  degrees of freedom
AIC: 169.07

Number of Fisher Scoring iterations: 4
  1. Test Assumption
Code
```{r}
plot(slogm_PA)
```

  • use broom:: package
    • function augment(_) to calculate residual and cook distance
Code
```{r}
library(broom)

slogm_PA %>% 
  augment() %>% 
  bind_cols(idR = asthmads$idR, .)
```
  • large residual
    • common cut off point 2 or 3 (absolute)
Code
```{r}
slogm_PA %>% 
  augment() %>%
  mutate(abs_stdresid = abs(.std.resid)) %>% 
  slice_max(order_by = abs_stdresid, n = 5)
```
  • large influential
    • traditional rule of thumb, cook distance >1
    • 4/n, where n is there number of observation
    • top percentage
Code
```{r}
slogm_PA %>% 
  augment() %>% 
  bind_cols(idR = asthmads$idR, .) %>% 
  slice_max(order_by = .cooksd, n = 5)
```
  1. Model Fitness - Hosmer Lemeshow Test
  • grouped the dataset into deciles, based on predicted probability
  • compare the number of obeserved and expected outcome in each group
  • non-significant indicate good fit
  • use hoslem.test from ResourceSelection:: package
  • need to change outcome to 0 and 1
Code
```{r}
library(ResourceSelection)

hoslem.test(x = as.numeric(asthmads$BMI_CCat)-1, 
            y = slogm_PA$fitted.values, 
            g = 9)
```
Warning in hoslem.test(x = as.numeric(asthmads$BMI_CCat) - 1, y =
slogm_PA$fitted.values, : The data did not allow for the requested number of
bins.

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  as.numeric(asthmads$BMI_CCat) - 1, slogm_PA$fitted.values
X-squared = 1.1517e-26, df = 0, p-value < 2.2e-16
  1. Model Fitness - Area Under ROC Curve
  • plot the diagnostic ability, as its discrimination threshold is varied
  • i.e., plot the Sensitivity against 1-Specificity, at various threshold setting
  • close to 1 indicate good fit
Code
```{r}
library(pROC)

slogm_PA %>% 
  augment(type.predict = "response") %>% 
  roc(BMI_CCat, .fitted)
```
Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases

Call:
roc.data.frame(data = ., response = BMI_CCat, predictor = .fitted)

Data: .fitted in 63 controls (BMI_CCat Min Changed) < 87 cases (BMI_CCat Marked Changed).
Area under the curve: 0.7482
Code
```{r}
slogm_PA %>% 
  augment(type.predict = "response") %>% 
  roc(BMI_CCat, .fitted) %>% 
  plot()
```
Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases

we can also plot manually with ggplot

Code
```{r}
slogm_PA_roc <- slogm_PA %>% 
  augment(type.predict = "response") %>% 
  roc(BMI_CCat, .fitted)
```
Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases
Code
```{r}
tibble(threshold = slogm_PA_roc$thresholds,
       sensitivity = slogm_PA_roc$sensitivities, 
       specificity = slogm_PA_roc$specificities)
```
  1. Confusion Matrix
  • cross tabulate observed value and predicted value
  • Derived Metrics
    • accuracy: overall correct classifier (TP+TN)/total
    • sensitivity
    • specificity
    • precision
Code
```{r}
library(caret)

slog_PA_predds <- slogm_PA %>% 
  augment(type.predict = "response") %>% 
  mutate(predicted = cut(.fitted, 
                         breaks = c(-Inf, .50, Inf), 
                         labels = c("Min Changed", "Marked Changed"))) %>% 
  select(observed = BMI_CCat, predicted)
slog_PA_predds
```
Code
```{r}
slog_PA_ctab <- with(slog_PA_predds, table(observed, predicted))
slog_PA_ctab
```
                predicted
observed         Min Changed Marked Changed
  Min Changed             53             10
  Marked Changed          30             57
Code
```{r}
confusionMatrix(slog_PA_ctab, positive = "Marked Changed")
```
Confusion Matrix and Statistics

                predicted
observed         Min Changed Marked Changed
  Min Changed             53             10
  Marked Changed          30             57
                                          
               Accuracy : 0.7333          
                 95% CI : (0.6551, 0.8022)
    No Information Rate : 0.5533          
    P-Value [Acc > NIR] : 4.173e-06       
                                          
                  Kappa : 0.4756          
                                          
 Mcnemar's Test P-Value : 0.002663        
                                          
            Sensitivity : 0.8507          
            Specificity : 0.6386          
         Pos Pred Value : 0.6552          
         Neg Pred Value : 0.8413          
             Prevalence : 0.4467          
         Detection Rate : 0.3800          
   Detection Prevalence : 0.5800          
      Balanced Accuracy : 0.7447          
                                          
       'Positive' Class : Marked Changed  
                                          

5.3 Package gtsummary::

Code
```{r}
asthmads %>% 
  tbl_uvregression(method = glm, 
                   method.args = list(family = binomial), 
                   y = BMI_CCat,
                   include = PA_Cat,
                   exponentiate = T)
```
Characteristic N OR1 95% CI1 p-value
PA_Cat 150


    Low Intensity

    High Intensity
10.1 4.65, 23.6 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

5.4 Multiple Logistic Regression

5.4.1 Create Model

Code
```{r}
asthmads %>% 
  tbl_uvregression(method = glm, 
                   method.args = list(family = binomial), 
                   y = BMI_CCat,
                   include = c(Gender, Age, WorkStatus, BMI_Pre, 
                               PA_HW, PA_Cat),
                   exponentiate = T) %>% 
  bold_p(t = .25)
```
Characteristic N OR1 95% CI1 p-value
Gender 150


    Female

    Male
0.51 0.26, 0.99 0.050
Age (year) 150 0.98 0.87, 1.09 0.6
Employment 150


    Unemployed

    Employed
1.04 0.55, 2.01 0.9
BMI_Pre 150 1.06 0.99, 1.13 0.078
Physical Activity (total hour per week) 150 2.60 1.92, 3.73 <0.001
PA_Cat 150


    Low Intensity

    High Intensity
10.1 4.65, 23.6 <0.001
1 OR = Odds Ratio, CI = Confidence Interval
Code
```{r}
mlogm_bmic <- glm(BMI_CCat ~ Gender + Age + PA_HW, 
                family = binomial(), 
                asthmads)

mlogm_bmic %>% 
  tbl_regression(exponentiate = T, 
                 estimate_fun = partial(style_ratio, digits = 3), 
                 pvalue_fun = partial(style_pvalue, digits = 2)) %>% 
  bold_p()
```
Characteristic OR1 95% CI1 p-value
Gender


    Female
    Male 0.541 0.231, 1.242 0.15
Age (year) 1.009 0.877, 1.162 0.90
Physical Activity (total hour per week) 2.607 1.919, 3.764 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

5.4.2 Model Assumption Test

Code
```{r}
plot(mlogm_bmic)
```

residual

Code
```{r}
mlogm_bmic %>% 
  augment() %>%
  bind_cols(id = asthmads$id, .) %>% 
  mutate(abs_stdresid = abs(.std.resid)) %>% 
  slice_max(order_by = abs_stdresid, n = 5)
```

influential

Code
```{r}
mlogm_bmic %>% 
  augment() %>%
  bind_cols(id = asthmads$id, .) %>% 
  slice_max(order_by = .cooksd, n = 5)
```

5.4.3 Model Fitness

HL Test

Code
```{r}
library(ResourceSelection)

hoslem.test(x = as.numeric(asthmads$BMI_CCat)-1, 
            y = mlogm_bmic$fitted.values, 
            g = 10)
```

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  as.numeric(asthmads$BMI_CCat) - 1, mlogm_bmic$fitted.values
X-squared = 8.8943, df = 8, p-value = 0.3513

ROC Curve

Code
```{r}
library(pROC)

mlogm_bmic %>% 
  augment(type.predict = "response") %>% 
  roc(BMI_CCat, .fitted)
```
Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases
Code
```{r}
mlogm_bmic %>% 
  augment(type.predict = "response") %>% 
  roc(BMI_CCat, .fitted) %>% 
  plot()
```
Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases

Call:
roc.data.frame(data = ., response = BMI_CCat, predictor = .fitted)

Data: .fitted in 63 controls (BMI_CCat Min Changed) < 87 cases (BMI_CCat Marked Changed).
Area under the curve: 0.8574

Confusion Matrix

Code
```{r}
library(caret)

mlogm_bmic_predds <- mlogm_bmic %>% 
  augment(type.predict = "response") %>% 
  mutate(predicted = cut(.fitted, 
                         breaks = c(-Inf, .50, Inf), 
                         labels = c("Min Changed", "Marked Changed"))) %>% 
  select(observed = BMI_CCat, predicted)

mlogm_bmic_ctab <- with(mlogm_bmic_predds, table(observed, predicted))

confusionMatrix(mlogm_bmic_ctab, positive = "Marked Changed")
```
Confusion Matrix and Statistics

                predicted
observed         Min Changed Marked Changed
  Min Changed             48             15
  Marked Changed          18             69
                                          
               Accuracy : 0.78            
                 95% CI : (0.7051, 0.8435)
    No Information Rate : 0.56            
    P-Value [Acc > NIR] : 1.506e-08       
                                          
                  Kappa : 0.5514          
                                          
 Mcnemar's Test P-Value : 0.7277          
                                          
            Sensitivity : 0.8214          
            Specificity : 0.7273          
         Pos Pred Value : 0.7931          
         Neg Pred Value : 0.7619          
             Prevalence : 0.5600          
         Detection Rate : 0.4600          
   Detection Prevalence : 0.5800          
      Balanced Accuracy : 0.7744          
                                          
       'Positive' Class : Marked Changed